Creating Training Corpora for NLG Micro-Planning

نویسندگان

  • Claire Gardent
  • Anastasia Shimorina
  • Shashi Narayan
  • Laura Perez-Beltrachini
چکیده

In this paper, we present a novel framework for semi-automatically creating linguistically challenging microplanning data-to-text corpora from existing Knowledge Bases. Because our method pairs data of varying size and shape with texts ranging from simple clauses to short texts, a dataset created using this framework provides a challenging benchmark for microplanning. Another feature of this framework is that it can be applied to any large scale knowledge base and can therefore be used to train and learn KB verbalisers. We apply our framework to DBpedia data and compare the resulting dataset with Wen et al. (2016)’s. We show that while Wen et al.’s dataset is more than twice larger than ours, it is less diverse both in terms of input and in terms of text. We thus propose our corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of handling the complex interactions occurring during in micro-planning between lexicalisation, aggregation, surface realisation, referring expression generation and sentence segmentation. To encourage researchers to take up this challenge, we recently made available a dataset created using this framework in the context of the WEBNLG shared task.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

GenNext: A Consolidated Domain Adaptable NLG System

We introduce GenNext, an NLG system designed specifically to adapt quickly and easily to different domains. Given a domain corpus of historical texts, GenNext allows the user to generate a template bank organized by semantic concept via derived discourse representation structures in conjunction with general and domain-specific entity tags. Based on various features collected from the training c...

متن کامل

Enabling text readability awareness during the micro planning phase of NLG applications

Currently, there is a lack of text complexity awareness in NLG systems. Much attention has been given to text simplification. However, based upon results of an experiment, we unveiled that sophisticated readers in fact would rather read more sophisticated text, instead of the simplest text they could get. Therefore, we propose a technique that considers different readability levels during the m...

متن کامل

Statistical Generation: Three Methods Compared and Evaluated

Statistical NLG has largely meant n-gram modelling which has the considerable advantages of lending robustness to NLG systems, and of making automatic adaptation to new domains from raw corpora possible. On the downside, n-gram models are expensive to use as selection mechanisms and have a built-in bias towards shorter realisations. This paper looks at treebank-training of generators, an altern...

متن کامل

Recent Advances in Natural Language Generation: A Survey and Classification of the Empirical Literature

Natural Language Generation (NLG) is defined as the systematic approach for producing human understandable natural language text based on nontextual data or from meaning representations. This is a significant area which empowers human-computer interaction. It has also given rise to a variety of theoretical as well as empirical approaches. This paper intends to provide a detailed overview and a ...

متن کامل

A Statistical NLG Framework for Aggregated Planning and Realization

We present a hybrid natural language generation (NLG) system that consolidates macro and micro planning and surface realization tasks into one statistical learning process. Our novel approach is based on deriving a template bank automatically from a corpus of texts from a target domain. First, we identify domain specific entity tags and Discourse Representation Structures on a per sentence basi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017